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Abstract 

This paper defines a method for lexicon in the 
biomedical domain from comparable corpora. 
The method is based on compositional transla- 
tion and exploits morpheme-level translation 
equivalences. It can generate translations for 
a large variety of morphologically constructed 
words and can also generate 'fertile' transla- 
tions. We show that fertile translations in- 
crease the overall quality of the extracted lex- 
icon for English to French translation. 

1 Introduction 

Comparable corpora are composed of texts in dif- 
ferent languages which are not translations but deal 
with the same subject matter and were produced in 
similar situations of communication so that there is 
a possibility to find translation pairs in the texts. 
Comparable corpora have been used mainly in the 
field of Cross-Language Information Retrieval and 
Computer-Aided Translation (CAT). In CAT, which 
is our field of application, comparable corpora have 
been used to extract domain-specific bilingual lexi- 
cons for language pairs or subject domains for which 
no parallel corpora is available. Another advantage 
of comparable corpora is that they contain more id- 
iomatic expressions than parallel corpora do. In- 
deed, the target texts of parallel corpora are trans- 
lations and bear the influence of the source lan- 
guage whereas the target texts of comparable cor- 
pora are original, spontaneous productions. The 
main drawback of comparable corpora is that much 
fewer translation pairs can be extracted than in paral- 
lel corpora because (i) not all source language terms 
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do have a translation in the target texts and (ii) when 
there is a translation, it may not be present in its 
canonical form, precisely because the target texts 
are not translations. As observed by Baker (1996), 
translated texts tend to bear features like explica- 
tion, simplification, normalization and leveling out. 
For instance, an English-French comparable cor- 
pus may contain the English term post-menopausal 
but not its "normalized" or "canonical" transla- 
tion in French (post-menopausique) . However, 
there might be some morphological or paraphras- 
tic variants in the French texts like post-menopause 
'post-menopause' or apres la menopause 'after 
the menopause' . The solution that consists in 
increasing the size of the corpus in order to 
find more translation pairs or to extract par- 



allel segments of text (Fung and Cheung, 2004 



Rauf and Schwenk, 2009| ) is only possible when 
large amounts of texts are available. In the case 
of the extraction of domain-specific lexicons, we 
quickly face the problem of data scarcity: in order to 
extract high-quality lexicons, the corpus must con- 
tain text dealing with very specific subject domains 
and the target and source texts must be highly com- 
parable. If one tries to increase the size of the cor- 
pus, one takes the risk of decreasing its quality by 
lowering its comparability or adding out-of-domain 
texts. Studies support the idea that the quality of the 
corpora is more important than its size. Morin et al. 
(2007) show that the discourse categorization of the 
documents increases the precision of the lexicon de- 
spite the data sparsity. Bo and Gaussier (2010) show 
that they improve the quality of a lexicon if they im- 
prove the comparability of the corpus by selecting 



a smaller - but more comparable - corpus from an 
initial set of documents. Consequently, one solu- 
tion for increasing the number or translation pairs 
is to focus on identifying translation variants. This 
paper explores the feasibility of identifying "fertile" 
translations in comparable corpora. In parallel texts 
processing, the notion of fertility has been defined 
by Brown et al. (1993). They defined the fertility 
of a source word e as the number of target words 
to which e is connected in a randomly selected 
alignment. Similarly, we call a fertile translation a 
translation pair in which the target term has more 
words than the source term. We propose to identify 
such translations with a method mixing morpholog- 
ical analysis and compositional translation : (i) the 
source term is decomposed into morphemes: post- 
menopausal is split into post- + menopaus^ ; (ii) 
the morphemes are translated as bound moiphemes 
or fully autonomous words: post- becomes post- or 
apres and menopause becomes menopause and (iii) 
the translated elements are recomposed into a target 
term: post-menopause, apres la menopause. 

This paper falls into 4 sections. Section |2] out- 
lines recent research in compositionality-based lex- 
icon extraction. Section [3] explains the algorithm 
of morpho-compositional translation. Experimental 
data and results and described in sections [4] and [5] 

2 Related work 

Most of the research work in lexicon extraction 
from comparable corpora concentrates on same- 
length term alignment. To our knowledge, only 
Daille and Morin (2005) and Weller et al. (2011) 
tried to align terms of different lengths. Daille and 
Morin (2005) focus on the specific case of multi- 
word terms whose meanings are non-compositional 
and tried to align these multi-word terms with ei- 
ther single-word terms or multi-word terms using a 
context-based approacfU. Weller et al. (2011) con- 
centrate on aligning German NOUN-NOUN com- 



1 We use the following notations for morphemes: trailing hy- 
phen for prefixes (a-), leading hyphen for suffixes (-a), both for 
confixes (-a-) and no hyphen for autonomous morphemes (a). 
Morpheme boundaries are represented by a plus sign (+). 

2 Context-based methods were introduced by Rapp (1995) 
and Fung (1997). They consist in comparing the contexts in 
which the source and target terms occur. Their drawback is that 
they need the source and target terms to be very frequent. 



pounds to Noun Noun and Noun Prep Noun 
structures in French and English. 

We chose to work in the framework of 
compositionality-based translation because: (i) 
compositional terms form more than 60% of 
the new terms found in techno-scientific do- 
mains, and especially in the field of biomedecine 
dNamer and Baud, 2007] ) (ii) compositionality- 
based methods have been shown to clearly 
outperform context-based ones for the trans- 
lation of terms with compositional meaning 
flMorin and Daille, 20101 ) (iii) we believe that 
compositionality-based methods offer the opportu- 
nity to generate fertile translations if combined with 
a morphology-based approach. 

2.1 Principle of compositional translation 

Compositional translation relies on the principle of 
compositionality which states that "the meaning of 
the whole is a function of the meaning of the parts" 
dKeenan and Faltz, 19851 24-25). Applied to bilin- 
gual lexicon extraction, compositional translation 
(CT) consists in decomposing the source term into 
atomic components (V), translating these compo- 
nents into the target language (T), recomposing the 
translated components into target terms (1Z) and fi- 
nally filtering the generated translations with a se- 
lection function (5): 

erc'ab") 

= S(ft(r(2?("ab")))) 
= S(tt(T({a,b}))) 
= S(K({T(a) x T(b)})) 
= 5(7e({A, b})) 

= 5({A,B}, {B,A}) 
= "BA" 

where "ab" is a source term composed of a and b, 
"BA" is a target term composed of B and A and there 
exists a bilingual resource linking a to A and b to B. 

2.2 Implementations of compositional 
translation 

Existing implementations differ on the kind of 
atomic components they use for translation. 

Lexical compositional translation 

dGrefenstette, 1999[ |Baldwin and Tanaka, 2004 
IRobitaille et al, 2006[ |Morin and Daille, 2010] ) 



deals with multi-word term to multi-word term 



alignment and uses lexical word^] as atomic 
components : rate of evaporation is translated 
into French taux d' evaporation by translating 
rate as taux and evaporation as evaporation 
using dictionary lookup. Recomposition may 
be done by permutating the translated compo- 



nents (Morin and Daille, 2010) or with translation 



patterns (Bald win and Tanaka, 2004] ) . 

Sublexical compositional translation deals with 
single-word term translation. The atomic compo- 
nents are subparts of the source single-word term. 
Cartoni (2009) translates neologisms created by 
prefixation with a special formalism called Bilin- 
gual Lexeme Formation Rules. Atomic compo- 
nents are the prefix and the lexical base: Italian 
neologism anticonstituzionale ' anticonstitution' is 
translated into French anticonstitution by translating 
the prefix anti- as anti- and the lexical base con- 
stituzionale as constitution. Weller et al. (2011) 
translate two types of single-word term. German 
single-word term formed by the concatenation of 
two neoclassical roots are decomposed into these 
two roots, then the roots are translated into tar- 
get language roots and recomposed into an English 
or French single- word term, e.g. Kalori\metriei 
is translated as calori\metry2- German N0UNi- 
NOUN2 compounds are translated into French and 
English NOUN1NOUN2 or NOUN1 Prep Noun 2 
multi-word term, e.g. Elektronen^\-mikroskop^i is 
translated as electron^ \ microscope^- 

2.3 Challenges of compositional translation 

Compositional translation faces four main chal- 
lenges which are (i) morphosyntactic variation: 

source and target terms' morphosyntactic structures 
are different: anti-cancer^ 0U n —> anti- cancereux a m 
'anti-cancerous' ; (ii) lexical variation: source and 
target terms contain semantically related - but not 
equivalent - words: machine translation — > tra- 
duction automatique 'automatic translation' ; (iii) 
fertility: the target term has more content words 
than the source term: isothermal snowpack — >■ 
manteau neigeux isotherme 'isothermal snow man- 
tel' ; (iv) terminological variation: a source term 
can be translated as different target terms: oophorec- 



tomy — > ovariectomie 'oophorectomy', ablation des 
ovaires 'removal of the ovaries '. 

Solutions to morphosyntactic, lexical and 
to some extent terminological variation have 
been proposed in the form of thesaurus lookup 
(Robitaille et al., 20 06), morphological derivation 



as opposed to grammatical words: preposition, determin- 
ers, etc. 



rules (Morin and Daille, 2010] ), morphological vari- 
ant dictionaries (Cartoni, 2009) or morphosyntactic 
translation patterns (Ba ldwin and Tanaka, 2004[ 
Welle r~et al., 201 \\ . Fertility has been addressed by 
Weller et al. (201 1) for the specific case of German 
Noun-Noun compounds. 

3 Morpho-compositional translation 

3.1 Underlying assumptions 

Morpho-compositional translation (morpho- 
compositional translation) relies on the following 
assumptions: 

Lexical subcompositionality. The lexical items 
which compose a multi-word term or a single-word 
term may be split into semantically-atomic compo- 
nents. These components may be either free (i.e. 
they can occur in texts as autonomous lexical items 
like toxicity in cardiotoxicity) or bound (i.e. they 
cannot occur as autonomous lexical items, in that 
case they correspond to bound morphemes like - 
cardio- in cardiotoxicity). 

Irrelevance of the bound/free feature in trans- 
lation. Translation occurs regardless of the compo- 
nents' degree of freedom: -cardio- may be translated 
as cceur 'heart' as in cardiotoxicity — > toxicite pour 
le cceur 'toxicity to the heart'. 

Irrelevance of the bound/free feature in allo- 
morphy. Allomorphy happens regardless of the 
components' degree of freedom: -cardio-, coeur 
'heart', cardiaque 'cardiac' are possible instantia- 
tions of the same abstract component and may lead 
to terminological variation as in cardiotoxicity — > 
cardiotoxicite 'cardiotoxicity ', toxicite pour le cceur 
'toxicity to the heart', toxicite cardiaque 'cardiac 
toxicity'. 

Like other sublexical approaches, the main idea 
behind morpho-compositional translation is to go 
beyond the word level and work with subword 
components. In our case, these components are 
morpheme-like items which either (i) bear refer- 



ential lexical meaning like confixed {-cyto-, -bio- 
, -ectomy-) and autonomous lexical items {can- 
cer, toxicity) or (ii) can substantially change the 
meaning of a word, especially prefixes (and-, post- 
, co-...) and some suffixes (-less, -like...). Un- 
like other approaches, morpho-compositional trans- 
lation is not limited to small set of source-to-target 
structure equivalences. It takes as input a single 
morphologically constructed word unit which can 
be the result of prefixation ' ' pretreatment' , confix- 
ation 'densitometry', suffixation 'childless', com- 
pounding ' anastrozole-associated' or any combi- 
nations of the four. It outputs a list of n words 
who may or may not be morphologically con- 
structed. For instance, postoophorectomy may be 
translated as postovariectomie 'postoophorectomy ', 
apres I'ovariectomie 'after the oophorectomy' or 
apres V ablation des ovaires 'after the removal of the 
ovaries'. 

3.2 Algorithm 

As an example, we show the translation of the ad- 
jective cytotoxic into French using a toy dataset. 
Let Comp\ ype be a list of components in language 
/ where type equals pref for prefixes, conf for 
confixes, suff for suffixes and free for free lex- 
ical units ; Trans be the translation table which 
maps source and target components ; Var 1 be a 
table mapping related lexical units in language I ; 
Stop 1 a list of stopwords in language I ; Corpus 1 
a lemmatized, pos-tagged corpus in language I: 

Com P e conf = {"Cyto-} ; 

CompJl ee = {cytotoxic, cytotoxicity, toxic} ; 
Comp S c r onf = {-cyto-} ; 
Compj r ree — {cellule, toxique} ; 
Trans = {{-cyto — > -cyto-, cellule}, 

{toxic — > toxique}} ; 
Var en = {cytoxic — > cytoxicity} ; 
Stop ir = {pour, le} ; 

Corpus? 1 " = "le/DET cytotoxicite/N etre/AUX le/DET 
propriete/N de/PREP ce/DET qui/PRO etre/AUX 
toxique/A pour/PREP le/DET cellule/N ./PUN" ; 
'The cytotoxicity is the property of what is toxic to 
the cells.' 

Morpho-compositional translation takes as input a 
source language single-word term and outputs zero 

4 we use the term confix as a synonym of neoclassical roots 
(Latin or Ancient Greek root words). 



or several target language single-word terms or 
multi-word terms. It is the result of the sequen- 
tial application of four functions to the input single- 
word term: decomposition (T>), translation (T), re- 
composition (1Z) and selection (S). 

3.2.1 Decomposition function 

The decomposition function V works in two steps 
V\ and T>2- 

Step 1 of decomposition (Pi) splits the in- 
put single-word term into minimal components by 
matching substrings of the single-word term with 
the resources Comp src , Comp^j, Comp^j-j, 
CompY r c ee and respecting some length constraints 
on the substrings. For example, one may split a 
single- word term SWT\^ n of n characters into pre- 
fix Prefix and lexical base LexBasei + i :Tl pro- 
vided that 'sWT hi G Comp s p r r c ef and SWT i+ltTl € 
CompJ^ ee and n — i > CO ; £0 being empirically 
set to 5. A single-word term is first split into an op- 
tional prefixe + basei, then basei is split into base2 + 
optional suffix, then base2 is split into one or several 
confixes or lexical items. When several splittings 
are possible, only the ones with the highest number 
of minimal components are retained. 

5(^(T(2? 2 (2? 1 ("cytotoxic'"))))) 
= S(Tl(T(V 2 ({cyto, toxic})))) 

Step 2 of decomposition (V2) gives out all 
possible decompositions of the single-word term 
by enumerating the different concatenations of its 
minimal components. For example, if single-word 
term "abc" has been split into minimal compo- 
nents {a,b,c}, then it has 4 possible decompositions: 
{abc}, {a,bc}, {ab,c}, {a,b,c}. For a single-word 
term having n minimal components, there exists 
2™ _1 possible decompositions. 

S(TZ{T{V 2 {{cyto, toxic})))) 
= «S(7\L(T({cyto, toxic}, {cytotoxic}))) 

The concatenation of the minimal components 
into bigger components increases the chances of 
finding translations. For example, consider the 
single-word term non-cytotoxic and a dictionary 
having translations for non, cyto and cytotoxic but 
no translation for toxic. If we stick to the sole out- 
put of V\ {non-,-cyto-,toxic}, the translation of non- 
cytotoxic will fail because there is no translation for 



toxic. Whereas if we also consider the output of T>2 
which contains the decomposition {non-,cytotoxic} , 
we will be able to translate non-cytotoxic because 
the dictionary has an entry for cytotoxic. 

3.2.2 Translation function 

The translation function T provides translations 
for each decomposition output by V. Applying the 
compositionality principle to translation, we con- 
sider that the translation of the whole is a function of 
the translation of the parts: T(a, b) ^ T(a) x T(b). 
For a given decomposition {c\,...c n } having n 
components, there exists Yli=i |T(q)| possible 
translations. Components' translations are ob- 
tained using the Trans and Var resources: T(c) = 
Trans{c) U Trans(Var src {c)) U Var^^Transic)) 
If one of the component cannot be translated, the 
translation of the whole decomposition fails. 

S(ll(T ({cyto, toxic}, {cytotoxic}))) 
= S(7e(T(cyto) x T(toxic),T(cytotoxic))) 
= <S(7?.({cyto, toxique}, {cellule, toxique}, 
{cytotoxicite})) 

3.2.3 Recomposition function 

The recomposition function 1Z takes as input the 
translations outputted by T and recomposes them 
into sequences of one or several lexical items. It 
takes place in two steps. 

Step 1 of recomposition {TZi) generates, for a 
given translation of n items, all of the n\ possi- 
ble permutations of these items. As a general rule, 
0(n\) procedures should be avoided but we are per- 
muting small sets (up to 4 items). This captures the 
fact that components' order may be different in the 
source and target language (distortion). Once the 
components have been permutated, we generate, for 
each permutation, all the different concatenations of 
its components into lexical items (like it is done in 
step 2 of decomposition). 

5(7?. 2 (7?. 1 ({cyto,toxique},{cellule,toxique}, 
{cytotoxicite}))) 

= <S(7?. 2 ({cyto, toxique}, {cytotoxique}, 
{toxique,cyto}, {toxiquecyto}, {cellule, toxique}, 
{celluletoxique}, {toxique, cellule}, 
{ toxiquecellule } , {cytotoxicite } ) ) 

Step 2 of recomposition (IZ2) filters out the 
ouput of 1Z\ using heuristic rules. For example, 



a sequence of lexical items L = {h,...l n } would 
be filtered out provided that 3 I € L \ I € 
Comp^ ef U Comply U Comp 1 ^^, i.e. recom- 
position {cytotoxique} would be accepted but not 
{-cyto-, toxique} because -cyto- is a bound compo- 
nent (it should not appear as an autonomous lexical 
item). 

<S(7?. 2 ({cyto,toxique}, {cytotoxique}, 
{toxique,cyto}, {toxiquecyto}, {cellule, toxique}, 
{celluletoxique}, {toxique, cellule}, 
{toxiquecellule } , {cytotoxicite } ) ) 
= 5({cytotoxique}, {toxiquecyto}, 
{cellule, toxique}, {celluletoxique}, 
{toxique, cellule}, {toxiquecellule}, {cytotoxicite}) 

These concatenations correspond to the final lex- 
ical units which will be matched against the tar- 
get corpus with the selection function. For exam- 
ple, the concatenation {toxique A , cellule B } corre- 
sponds to a translation made of two distinct lexical 
items: toxique followed by cellule. The concatena- 
tion {cytotoxique AB } corresponds to only one lexi- 
cal item: cytotoxique. 

3.2.4 Selection function 

The selection function S tries to match the se- 
quences of lexical items outputted by 1Z with the 
lemmas of the tokens of the target corpus. We call 
T = {ti, ...t m } a sequence of tokens from the target 
corpus, l{tk) the lemma of token tk and p{tk) the 
part-of-speech of token We call L = ...l n } a 
sequence of lexical items outputted by 1Z. L matches 
T if there exists a strictly increasing sequence of 
indices / = {ii,...i n } such as = lj and 

Vj, 1 < j < n and Vi, 1 < \ij-i — ij\ < CI and 
V£fc| k I, l(tk) € Stop t9t ; CI being empirically 
set to 3. 

= 5({cytotoxique}, {toxiquecyto}, 
{cellule, toxique}, {celluletoxique}, 
{toxique, cellule}, {toxiquecellule}, {cytotoxicite}) 

= "cytotoxicite/N", "toxique/A pour/PREP le/DET 
cellule/N" 

'cytotoxicity 7 , 'toxic to the cells' 

In other words, L is a subsequence of the lemmas of 
T and we allow at maximum CI closed-class words 
between two tokens which match the lemmas of L. 

For a given sequence of lexical items L, we col- 
lect from the target corpus all sequences of tokens 



Tij T2, ■■■T p which match L according to our above- 
mentioned definition. We consider two sequences 
Tl and T2 to be equivalent candidate translations if 
\T1\ = \T2\ and V(tl i ,i2 j ) such that tl G Tl,t2 G 
T2, i = j then Z(tl^) = l(t2j) and p(il;) = p(t2j), 
i.e. if two sequences of tokens correspond to the 
same sequence of (lemma, pos) pairs, these two se- 
quences are considered as the same candidate trans- 
lation. 

4 Experimental data 

We worked with three languages: English as source 
language and French and German as target lan- 
guages. 

4.1 Corpora 

Our corpus is composed of specialized texts from 
the medical domain dealing with breast can- 
cer. We define specialized texts as texts be- 
ing produced by domain experts and directed to- 
wards either an expert or a non-expert readership 
( [Bowker and Pearson, 2002] ). The texts were manu- 
ally collected from scientific papers portals and from 
information websites targeted to breast cancer pa- 
tients and their relatives. Each corpus has approxi- 
mately 400k words (cf. table [D. All the texts were 
pos-tagged and lemmatized using the linguistic anal- 
ysis suite XELDA0. We also computed the compa- 
rability of the corpora. We used the comparability 
measure defined by ( |Bo and Gaussier, 2010| ) which 
indicates, given a bilingual dictionary, the expecta- 
tion of finding for each source word of the source 
corpus its translation in the target corpus and vice- 
versa. The English-French corpus' comparability is 
0.71 and the English-German corpus' comparability 
is 0.45. The difference in comparability can be ex- 
plained by the fact that German texts on breast can- 
cer were hard to find (especially scientific papers): 
we had to collect texts in which breast cancer was 
not the main topic. 



Readership 


EN 


FR 


DE 


experts 


218.3k 


267.2k 


197.2k 


non-experts 


198.2k 


184.5k 


201.7k 


TOTAL 


416.5k 


451.75k 


398.9k 



4.2 Source terms 

We tested our algorithm on a set of source terms 
extracted from the English texts. The extraction 
was done in a semi-supervised manner. Step 1: 
We wrote a short seed list of English bound mor- 
phemes. We automatically extracted from the En- 
glish texts all the words that contained these mor- 
phemes. For example, we extracted the words 
postchemotherapy and poster because they con- 
tained the string post- which corresponds to a bound 
morpheme of English. Step 2: The extracted words 
were sorted : those which were not morphologically 
constructed were eliminated (like poster), and those 
which were morphologically constructed were kept 
(like postchemotherapy). The morphologically con- 
structed words were manually split into morphemes. 
For example, postchemotherapy was split into post- 
, -chemo- and therapy. Step 3: If some bound 
morphemes which were not in the initial seed list 
were found when we split the words during step 2, 
we started the whole process again, using the new 
bound morphemes to extract new morphologically 
constructed words. We also added hyphenated terms 
like ER-positive to our list of source terms. 

We obtained a set 2025 English terms with this 
procedure. For our experiments, we excluded from 
this set the source terms which had a translation in 
the general language dictionary and whose transla- 
tion was present in the target texts. The final test 
set for English-to-French experiments contains 1 839 
morphologically constructed source terms. The 
test set for English-to-German contains 1 824 source 
terms. 

4.3 Resources used in the translation step T 

Tables [2] and [3] show the size of the resources we 
used for translation. 

General language dictionary We used the gen- 
eral language dictionary which is part of the linguis- 
tic analysis suite Xelda. 

Domain-specific dictionary We built this re- 
source automatically by extracting pairs of cognates 
from the comparable corpora. We used the same 
technique as (Hauer a nd Kondrak, 201 1) : a SVM 



Table 1 : Composition and size of corpora in words tionarie; 



classifier trained on examples taken from online die- 
6l 



* ]http : / /w ww . temis . com 



* jhttp : / / www . diets . inf o/uddl . php| 



Morpheme translation table To our knowledge, 
there exists no publicly available morphology-based 
bilingual dictionary. Consequently, we asked trans- 
lators to create an ad hoc morpheme translation table 
for our experiment. This morpheme translation table 
links the English bound morphemes contained in the 
source terms to their French or German equivalents. 
The equivalents can be bound morphemes or lexical 
items. 

In order to handle the variation phenomena de- 
scribed in section 12.31 we used a dictionary of syn- 
onyms and lists of morphologically related words. 

The dictionary of synonyms is the one part of the 
Xelda linguistic analyzer. The lists of morpho- 
logically related words were built by stemming the 
words of the comparable corpora and the entries of 
the bilingual dictionary with a simple stemming al- 
gorithm ( [Porter, 1980] ). 





EN^FR 


EN^DE 


General language 


38k^60k 


38k^70k 


Domain-specific 


6.7k^6.7k 


6.4k^6.4k 


Morphemes (TOTAL) 


242^729 


242~>761 


prefixes 


50->134 


50^166 


confixes 


185^574 


185^563 


suffixes 


7-s>21 


7-1-32 



Table 2: Nb. of entries in the multilingual resources 





EN^EN 


FR^FR 


DE^DE 


Synonyms 


5.1k^7.6k 


2.4k->3.2k 


4.2k^4.9k 


Morphol. 


5.9k^l5k 


7.1k^l8k 


7.4k^l6k 



Table 3: Nb. of entries in the monolingual resources 



4.4 Resources used in the decomposition step 

CD) 

The decomposition function uses the entries of 
the bound morphemes translation table (242 en- 
tries) and a list of 85k lexical items composed 
of the entries of the general language dictionary 
and English words extracted from the Leipzig Cor- 



5 Evaluation 

5.1 Evaluation metrics 

As explained in section 12.21 compositional trans- 
lation consists in generating candidate transla- 
tions. These candidate translations can be filtered 



out with a classifier (Baldwin and Tanaka, 2004), 



by keeping only the translations which occur in 
the target texts of the corpus ( [Weller et al., 201 1 



Morin 'and Daille, 2010] ) or by using a search engine 



(Robitaille et al., 20 06). Unlike alignment evalua- 
tion in parallel texts, there is no reference align- 
mens to which the selected translations can be com- 
pared and we cannot use standard evaluation met- 



rics like AER (Oc h and Ney, 2000[ ). It is also diffi- 
cult to find reference lexicons in specific domains 
since the goal of the extraction process is to cre- 
ate such lexicons. Furthermore, we also wish to 
evaluate if the algorithm can identify non-canonical 
translations which, by definition, can not be found 
in a reference lexicon. Usually, the candidate trans- 
lations are annotated manually as correct or incor- 
rect by native speakers. Baldwin and Takana (2004) 
use two standards for evaluation: gold-standard, 
silver-standard. Gold-standard is the set of candi- 
date translations which correspond to canonical, ref- 
erence translations. Silver-standard corresponds to 
the gold-standard translations plus the translations 
which "capture the basic semantics of the source 
language expression and allow the source language 
expression to be recovered with reasonable confi- 
dence" (op. cit.). 

The first evaluation metric is the precision P 
which is the number of correct candidate transla- 
tions |Corr| over the total number of generated can- 
didate translations \A\: P = • In addition to 
precision, we propose to indicate the coverage C of 
the lexicon, i.e. the proportion of source terms (ST) 
which obtained at least one candidate translation re- 
gardless of its accuracy: 

ifl a(ST 4 ) 



C 



ST 



pus ([Quasthoff et al., 2006] ) which is a general lan- 
guage corpus. 



were a(STj) returns 1 if |A(STj)| > 1 else 0. As 
augmenting coverage tends to lower precision, we 
also compute OQ, the overall quality of the lexi- 
con, to get an idea of the coverage/precision trade- 
off: OQ = PxC. 



5.2 Results 

Compositional-translation methods give better re- 
sults when they are applied to general language texts 
rather than domain-specific texts. This is due to the 
fact that the translations of the components can be 
easily found in dictionaries since they belong to the 
general language and it is also easier to collect large 
corpora. Working with general language texts, 
Baldwin and Takana (2004) were able to generate 
candidate translations for 92% of their source terms 
and they report 43% (gold-standard) to 84% (sil- 
ver standard) of correct translations. The size of 
their corpus exceeds 80M words for each language. 
Cartoni (2009) works on the translation of prefixed 
Italian neologisms into French. He considers that 
the generated neologisms have a "confirmed exis- 
tence" if they occur more than five times on Inter- 
net. He finds that between 42% and 94% of the 
generated neologisms fall into that category. Re- 
garding domain-specific translation, Robitaille et 
al. (2006) use a search engine to build corpus from 
the web and incrementally collect translation pairs. 
They start with a list of 9.6 pairs (on average) with 
a precision of 92% and end up with a final output 
of 19.6 pairs on average with a precision of 81%. 
Morin and Daille (2009) could generate candidate 
translations for 15% of their source terms and they 
report 88% of correct alignments. The size of their 
corpus is 700k words per language. Weller et al. 
(2011) were able to generate 8% of correct French 
translations and 18% of correct English translations 
for their 2000 German compounds. Their corpus 
contains approximately 1.5M words per language. 

We ran the morpho-compositional translation pro- 
totype on the set of source terms described in 
section |4.2j The output candidate translations 
were manually annotated by two translators. Like 
Baldwin and Takana (2004), we used three an- 
notation values: canonical translation, recover- 
able translation and incorrect. In our case, re- 
coverable translations correspond paraphrastic and 
morphological translation variants. For exam- 
ple, the canonical translation for post-menauposal 
is post-menopausique . Recoverable translations 
are post-menopause 'post-menopause' and apres la 
menopause 'after the menopause' . Fertile transla- 
tions can be canonical translations if a non-fertile 



translation would have been more awkward. For 
example, the canonical translation for oestrogen- 
sensitive is sensible awe cestrogenes 'sensitive to oe- 
strogens '. A non-fertile translation would sound 
very unnatural. We computed inter-annotator agree- 
ment on a set of 100 randomly selected candi- 
date translations. We used the Kappa statistics 
dCarletta, 1996| ) and obtained a high agreement 
(0.77 for English to German translations and 0.71 
for English to French). 

First, we tested the impact of the linguistic re- 
sources described in section 1431 (B for Baseline dic- 
tionaries, D for Domain-specific dictionary, S for 
Synonyms, M for Morphologically related words). 
We also tested a simple Prefix+lemma translation 
(Pref) in similar vein to the work of Cartoni (2009) 
to serve as a line of comparison with our method. 
The results are given in tables 0] and [5] The best 
results in terms of overall quality are obtained with 
the combination of the baseline and domain-specific 
dictionaries (BD). Morphologically related words 
and synonyms increase coverage to the cost of preci- 
sion. Regarding English-to-French translations, we 
were able the generate translations for 26% of the 
source terms. The gold-standard precision is 60% 
and the silver standard precision is 67%. Regarding 
English-to-German translations, we were able the 
generate translations for 26% of the source terms. 
The gold-standard precision is 39% and the silver- 
standard precision is 43%. The prefix+lemma trans- 
lation method has a very high precision (between 
84% and 76%) but produces very few translations 
(between 1% and 2%). Coverage and precision 
scores compare well with other approaches know- 
ing that we have very small domain-specific corpora 
(400k words per language) and that our approach 
deals with a large number of morphological con- 
structions. The lower quality of the German trans- 
lations can be explained by the fact that the English- 
German corpus is much less comparable than the 
English-French corpus (0.45 vs. 0.71). 

We also tested the impact of the fertile transla- 
tions on the quality of the lexicon. Tables [6] and 
|7] show the evaluation scores with and without fer- 
tile translations. As expected, fertile translations 
enables us to increase the size of the lexicon but 
they are less accurate than non-fertile translations. 
Fertile translations increase the overall quality of 
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Table 4: Scores for the EN— >-FR lexicon 



Table 6: Scores without (-f) and with (+f) fertile 
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the English-French lexicon by 4% to 5%. This is 
not the case for English-German translations: fer- 
tile translations result in a big drop in precision. 
The overall quality does not significantly change. 
This might be partly due to the low comparabil- 
ity of the corpus but we think that the main reason 
lies in the morphological type of the languages in- 
volved in the translation. It is worth noticing that, 
if we consider only the non-fertile translations, the 
English-German lexicon has generally better scores 
than the English-French one. In fact, fertile vari- 
ants are more natural and frequent in French than 
in German. English and German are Germanic lan- 
guages with a tendency to build new words by agglu- 
tinating words or morphemes into one single word. 
Noun compounds such as oestrogen-independent or 
Ostrogen-unabhangige are common in these two 
languages. Conversely, French is a Romance lan- 
guage which prefers to use phrases composed of two 
nouns and a preposition rather than a single-noun 
compound (oestrogen-independent would be trans- 
lated as independant des cestrogenes 'independent 
to oestrogens'). It is the same with the bound mor- 
pheme/single word alternance. The term cytopro- 
tection will be translated into German as Zellschutz 
whereas in French it can be translated as cytopro- 
tection or protection de la cellule 'protection of the 
cell'. 



Table 7: Scores without (-f) and with (+f) fertile 
translations (EN^DE) 

6 Conclusion and future work 

We have proposed a method based on the compo- 
sitionality principle which can extract translations 
pairs from comparable corpora. It is capable of deal- 
ing with a largely variety of morphologically con- 
structed terms and can generate fertile translations. 
The added value of the fertile translations is clear- 
cut for English to French translation but not for En- 
glish to German translation. The English-German 
lexicon is better without the fertile translations. It 
seems that the added-value of fertile translations de- 
pends on the morphological type of the languages 
involved in translation. Future work includes the 
improvement of the identification of morphological 
variants. The morphological families extracted by 
the stemming algorithm are too broad for the pur- 
pose of translation. For example, the words desir- 
ability and desiring have the same stem but they are 
too distant semantically to be used to generate trans- 
lation variants. We need to restrict the morphologi- 
cal families to a small set of morphological relations 
(e.g. noun «-» relational adjective links). We will 
also work out a way to rank the candidate transla- 
tions. Several lines of research are possible : go be- 
yond the target corpora and learn a language model 



from a larger target corpus, mix compositional trans- 
lation with a context-based approach, learn part-of- 
speech patterns translation probabilities from a par- 
allel corpora (e.g. learning that it is more probable 
that a noun is translated as another noun or as a noun 
phrase rather than an adverb). A last improvement 
could be to gather morpheme correspondences from 
parallel data. 
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